Prepare environment

Let's load all the libraries we will use. A handy tool for loading packages is pacman. This package conveniently loads libraries, installs libraries that are not installed, including bioconductor packages!

# install pacman if not already installed
if (!require("pacman")) install.packages("pacman")

# use pacman to load libraries
pacman::p_load(tidyverse,
               DESeq2,
               ggrepel,
               plotly)

Background

We have seen that Spadefoot toad tadpoles have a developmentally fixed dorso-ventral gradient in pigmentation, and that tadpoles can increase or decrease their pigmentation based on the background they are raised on. We now want to explore what are the underlying genetic mechanism that are driving these changes in phenotype. Here, we will jump in around about the middle of a typical RNAseq workflow. We will have already done the following:

  1. Performed the experiment
  2. Extracted the RNA
  3. Sequenced the RNA
  4. Cleaned sequences and performed quality control
  5. Quantified sequences by mapping reads from each library onto our reference genome with salmon
  6. Combined counts from all libraries into a single table using tximport. During this step, we have also calculated 'per-gene' counts. I.e. counts of multiple transcripts per gene have been summarized so that we are left with single estimates per gene.
  7. Minor data filtering to remove mtDNA, non-coding DNA and other irrelevant target sequences, to be left with only protein-coding genes

Our starting point will therefore be the count data for our RNAseq experiment prepared using tximport. In this exercise we will do the following:

  1. Load the RNAseq count data quantified with salmon and explore it.
  2. Visualize count data of biological replicates using a PCA
  3. Perform a differentially expressed genes analysis

Load data

Lets load the count data:

txi<-readRDS("../data/salmon_gene_counts.rds")
str(txi)
## List of 4
##  $ abundance          : num [1:32531, 1:12] 3.17 0 0 0 13.65 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:32531] "PECUL23A000004" "PECUL23A000005" "PECUL23A000006" "PECUL23A000008" ...
##   .. ..$ : chr [1:12] "7D" "7V" "8D" "8V" ...
##  $ counts             : num [1:32531, 1:12] 299 0 0 0 879 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:32531] "PECUL23A000004" "PECUL23A000005" "PECUL23A000006" "PECUL23A000008" ...
##   .. ..$ : chr [1:12] "7D" "7V" "8D" "8V" ...
##  $ length             : num [1:32531, 1:12] 6209.3 71.4 547.6 1019.4 4242.4 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:32531] "PECUL23A000004" "PECUL23A000005" "PECUL23A000006" "PECUL23A000008" ...
##   .. ..$ : chr [1:12] "7D" "7V" "8D" "8V" ...
##  $ countsFromAbundance: chr "no"

For those of you unfamiliar, .rds files is a single R data object, created with saveRDS(). It is particularly useful for saving R objects that are not tabular, or more than 2-dimensional (e.g. lists, functions etc.).

Exploring the loaded data

Question: What does is the class and structure of this data object?

class(txi)
## [1] "list"
str(txi)
## List of 4
##  $ abundance          : num [1:32531, 1:12] 3.17 0 0 0 13.65 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:32531] "PECUL23A000004" "PECUL23A000005" "PECUL23A000006" "PECUL23A000008" ...
##   .. ..$ : chr [1:12] "7D" "7V" "8D" "8V" ...
##  $ counts             : num [1:32531, 1:12] 299 0 0 0 879 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:32531] "PECUL23A000004" "PECUL23A000005" "PECUL23A000006" "PECUL23A000008" ...
##   .. ..$ : chr [1:12] "7D" "7V" "8D" "8V" ...
##  $ length             : num [1:32531, 1:12] 6209.3 71.4 547.6 1019.4 4242.4 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:32531] "PECUL23A000004" "PECUL23A000005" "PECUL23A000006" "PECUL23A000008" ...
##   .. ..$ : chr [1:12] "7D" "7V" "8D" "8V" ...
##  $ countsFromAbundance: chr "no"
lapply(X=txi, FUN=head)
## $abundance
##                       7D       7V        8D        8V       10D      10V
## PECUL23A000004  3.172516  2.49415  1.846965  3.275358  9.258514 10.84399
## PECUL23A000005  0.000000  0.00000  0.000000  0.000000  0.000000  0.00000
## PECUL23A000006  0.000000  0.00000  0.000000  0.000000  0.000000  0.00000
## PECUL23A000008  0.000000  0.00000  0.000000  0.000000  0.000000  0.00000
## PECUL23A000011 13.646505 12.58050 11.662266 12.011888 20.974082 21.07977
## PECUL23A000012  0.000000  0.00000  0.000000  0.000000  0.000000  0.00000
##                      11D       11V      21D       21V       22D      22V
## PECUL23A000004  2.008093  2.847134 10.00756  6.463992  7.079812 11.03426
## PECUL23A000005  0.000000  0.000000  0.00000  0.000000  0.000000  0.00000
## PECUL23A000006  0.000000  0.000000  0.00000  0.000000  0.000000  0.00000
## PECUL23A000008  0.000000  0.000000  0.00000  0.076718  0.000000  0.00000
## PECUL23A000011 13.069257 12.598137 17.31247 15.845097 18.138504 17.05364
## PECUL23A000012  0.000000  0.000000  0.00000  0.000000  0.000000  0.00000
## 
## $counts
##                     7D      7V      8D      8V      10D      10V     11D
## PECUL23A000004 299.200 222.206 172.249 303.135  798.556  950.925 209.388
## PECUL23A000005   0.000   0.000   0.000   0.000    0.000    0.000   0.000
## PECUL23A000006   0.000   0.000   0.000   0.000    0.000    0.000   0.000
## PECUL23A000008   0.000   0.000   0.000   0.000    0.000    0.000   0.000
## PECUL23A000011 879.316 728.819 724.642 713.794 1238.869 1267.971 898.400
## PECUL23A000012   0.000   0.000   0.000   0.000    0.000    0.000   0.000
##                    11V      21D     21V      22D      22V
## PECUL23A000004 313.572  842.375 503.027  675.112  988.542
## PECUL23A000005   0.000    0.000   0.000    0.000    0.000
## PECUL23A000006   0.000    0.000   0.000    0.000    0.000
## PECUL23A000008   0.000    0.000   1.000    0.000    0.000
## PECUL23A000011 933.099 1005.042 858.488 1125.317 1044.139
## PECUL23A000012   0.000    0.000   0.000    0.000    0.000
## 
## $length
##                        7D         7V         8D         8V        10D
## PECUL23A000004 6209.34168 6537.26970 6379.89021 6485.19773 6201.16983
## PECUL23A000005   71.39554   71.39554   71.39554   71.39554   71.39554
## PECUL23A000006  547.63317  547.63317  547.63317  547.63317  547.63317
## PECUL23A000008 1019.39872 1019.39872 1019.39872 1019.39872 1019.39872
## PECUL23A000011 4242.39630 4250.94094 4250.63982 4163.96652 4246.70282
## PECUL23A000012  447.71650  447.71650  447.71650  447.71650  447.71650
##                       10V        11D        11V        21D        21V
## PECUL23A000004 6199.02702 6438.33500 6315.82807 6149.42485 6100.23764
## PECUL23A000005   71.39554   71.39554   71.39554   71.39554   71.39554
## PECUL23A000006  547.63317  547.63317  547.63317  547.63317  547.63317
## PECUL23A000008 1019.39872 1019.39872 1019.39872 1019.39872 1021.78100
## PECUL23A000011 4252.16543 4244.47366 4247.39914 4241.13665 4247.13092
## PECUL23A000012  447.71650  447.71650  447.71650  447.71650  447.71650
##                       22D        22V
## PECUL23A000004 6360.54714 6205.29626
## PECUL23A000005   71.39554   71.39554
## PECUL23A000006  547.63317  547.63317
## PECUL23A000008 1019.39872 1019.39872
## PECUL23A000011 4138.22686 4240.83801
## PECUL23A000012  447.71650  447.71650
## 
## $countsFromAbundance
## [1] "no"
  • It is a list with 4 entries, three arrays/matrices (abundamce, counts, length) and 1 character vector (countsFromAbundance).

Question: What is the content of this object and how is it arranged?

  • All three arrays have target transcripts as rows and biological samples per column.
  • Counts: estimate of the number of reads mapping to each transcript.
  • Abundance: Raw counts cannot be compared across samples because each library may vary slightly in terms of the total number of reads, differences in sequencing bias and difference in transcript lengths. Salmon (the program we used to quantify reads) also produces an array of "Abundances" which are normalized counts. According to the salmon documention, this is Transcripts Per Million (TPM). This basically means that per sample, the total counts add up to 1 million. We could check this:
apply(X=txi$abundance, FUN=sum, MARGIN = 2)
##       7D       7V       8D       8V      10D      10V      11D      11V 
## 933611.3 925431.3 931110.8 939398.7 932355.7 938520.5 945823.7 940997.4 
##      21D      21V      22D      22V 
## 930301.0 934942.5 930828.3 931353.4

In this case, they don't quite add up to 1 million because I have already filtered this matrix, to remove non-coding and mitochondrial DNA.

  • length: effective length of the target transcript.
  • countsFromAbundance: a character vector indicating whether counts were taken directly from the quantifier (salmon) or whether they have been calculated from the abundances by tximport. Default is no (counts are from salmon).

I would argue that it is generally good practice to always plot the distributions of data. This is a good way to get to know your data and also to spot any potentially considerations we have give any analyses we may perform. Lets go ahead and plot the distributions of counts for each sample.

txi$counts %>%
  as_tibble() %>% # turns matrix into tibble object class
  pivot_longer(everything(), names_to="sample", values_to="counts") %>% # turns wide table to long table
  mutate(counts=counts+1) %>% # we are going to plot on a log scale, so lets add 1 to each count so that log10(1) = 0
#  filter(counts>1) %>%
  ggplot(aes(counts)) +
  geom_histogram() +
  ylab("number of genes") +
  scale_x_log10() + # plot x-axis on a log scale
  facet_wrap(~sample) # facet each sample into its own plot

Question: What do these distributions tell us about our count data?

  • Many loci have 0 counts. Why might that be?
  • The distributions for all samples are similar.
  • The data is not normally distributed.

Prefiltering

We will see later on that software for performing differential gene expression is quite good at dealing with non-normal, high zero-count data, but we can make our lives a little easier by removing at least those genes/loci that don't have any mapped reads.

# find rownames where all counts == 0
keepers<-rownames(txi$counts)[rowSums(txi$counts)>0]
length(keepers)
## [1] 21412
# keep all those rows 
txi$counts<-txi$counts[keepers,]
txi$abundance<-txi$abundance[keepers,]
txi$length<-txi$length[keepers,]

#confirm rows have been dropped
dim(txi$counts)
## [1] 21412    12

Adding biological/experimental meta data

To make any sort of inference, we need to add some additional information to the column names that tell us about the sample types and experimental conditions.

Let's load our design matrix. This is a .csv file where we have indicate which sample came from what treatment.

samples<-read.csv("../data/experiment_data.csv")
samples
##   specimen_id background
## 1           7      white
## 2           8      white
## 3           9      white
## 4          10      black
## 5          11      black
## 6          12      black
## 7          21      white
## 8          22      black

This tells us which specimens were subjected to which background colour. We can use some of our Tidyverse knowledge from the previous tutorial to now match this with the column names from the gene expression data.

rna_samples<-tibble(
  sample_id=colnames(txi$counts),
  specimen_id=str_extract(sample_id, "\\d+") %>% as.numeric(),
  position=str_extract(sample_id, "[A-Z]")
) %>%
  left_join(samples)

rna_samples
## # A tibble: 12 × 4
##    sample_id specimen_id position background
##    <chr>           <dbl> <chr>    <chr>     
##  1 7D                  7 D        white     
##  2 7V                  7 V        white     
##  3 8D                  8 D        white     
##  4 8V                  8 V        white     
##  5 10D                10 D        black     
##  6 10V                10 V        black     
##  7 11D                11 D        black     
##  8 11V                11 V        black     
##  9 21D                21 D        white     
## 10 21V                21 V        white     
## 11 22D                22 D        black     
## 12 22V                22 V        black

Question: How many repliactes do we have per tissue type per background?

rna_samples %>%
  group_by(position, background) %>%
  tally()
## # A tibble: 4 × 3
## # Groups:   position [2]
##   position background     n
##   <chr>    <chr>      <int>
## 1 D        black          3
## 2 D        white          3
## 3 V        black          3
## 4 V        white          3
  • There are two tissues (dorsal and ventral skin) and two treatments (black and white). The treatments refer to the background colour that we raised the tadpoles in.
  • There are 3 replicates per tissue per background.

Make a DESeqDataSet object

We can now combine the design matrix and the expression data into a format that DESeq2 likes. tximport plays very well with DESeq2, so this is easy!

# double check column names are the same order
colnames(txi$counts)==rna_samples$sample_id
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# make DESeq object
dds <- DESeqDataSetFromTximport(txi,
                         colData = rna_samples,
                         design = ~ position + background)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors

The dataset should have successfully been stored as a DESeqDataSet object. You should see that it has imported the counts and the length data from the txi object and the sample information from the design matrix. lets take a look at this object:

dds
## class: DESeqDataSet 
## dim: 21412 12 
## metadata(1): version
## assays(2): counts avgTxLength
## rownames(21412): PECUL23A000004 PECUL23A000008 ... PECUL23A062646
##   PECUL23A062647
## rowData names(0):
## colnames(12): 7D 7V ... 22D 22V
## colData names(4): sample_id specimen_id position background

Visualize your count data: PCA

First up, we may be curious as to whether there are some basic patterns in our data. For instance, we would expect biological replicates to cluster together based on their overall expression values. As it stands, we are dealing with a highly dimensional dataset though (each sample has >30k 'variables') and so a dimensional reduction technique like a Principal Component analysis might be a good idea.

Before we go any further however, we need to do a better job at normalizing our count data, than just making counts relative to 1 million per sample. This includes:

  1. Normalizing counts across samples, so that we can compare them more accurately
  2. Normalize counts with respect to library/target size.

This second normalization may require a bit of explanation. You should recall that the dds object contains not only the counts, but also the transcript lengths.

assay(dds, "avgTxLength") %>% head()
##                      7D       7V       8D       8V      10D      10V      11D
## PECUL23A000004 6209.342 6537.270 6379.890 6485.198 6201.170 6199.027 6438.335
## PECUL23A000008 1019.399 1019.399 1019.399 1019.399 1019.399 1019.399 1019.399
## PECUL23A000011 4242.396 4250.941 4250.640 4163.967 4246.703 4252.165 4244.474
## PECUL23A000017 1819.291 1752.458 1718.785 1703.185 1910.097 1954.068 2145.945
## PECUL23A000021 2250.962 2245.367 2256.992 2281.907 2307.002 2320.662 2249.317
## PECUL23A000024 2498.989 2388.780 2391.957 2395.681 2636.826 2553.543 2884.335
##                     11V      21D      21V      22D      22V
## PECUL23A000004 6315.828 6149.425 6100.238 6360.547 6205.296
## PECUL23A000008 1019.399 1019.399 1021.781 1019.399 1019.399
## PECUL23A000011 4247.399 4241.137 4247.131 4138.227 4240.838
## PECUL23A000017 1796.677 1865.823 1916.210 1887.871 2010.417
## PECUL23A000021 2276.333 2261.980 2263.461 2274.141 2318.005
## PECUL23A000024 2438.032 2876.946 2883.781 2878.930 2425.739

This shows that the different targets (i.e. the loci that the RNA-seq reads were mapped to) are of very different lengths. Logically, a larger target is expected to also receive more mapped reads! This can greatly impact our results and needs to be accounted for.

One technique to perform both of these normalizations is to use a Variance Stabilising Transformation. For this we need both the counts and the target lengths, and so conveniently, we can apply this directly to the dds object we have created.

dds_vst<-vst(dds)
assay(dds_vst) %>% head()
##                      7D       7V       8D       8V       10D       10V
## PECUL23A000004 8.408832 8.016173 7.795458 8.508675  9.698203 10.030478
## PECUL23A000008 4.834151 4.834151 4.834151 4.834151  4.834151  4.834151
## PECUL23A000011 9.782295 9.553379 9.594917 9.677563 10.270035 10.397357
## PECUL23A000017 7.847992 8.222655 8.381942 8.201588  8.050771  7.857424
## PECUL23A000021 8.691870 8.828108 9.012648 9.062747  8.649458  8.688747
## PECUL23A000024 5.879286 5.604807 5.993232 6.195822  7.348643  7.774410
##                      11D       11V       21D      21V       22D       22V
## PECUL23A000004  8.126095  8.584872  9.859782 9.247662  9.393828  9.968497
## PECUL23A000008  4.834151  4.834151  4.834151 5.111986  4.834151  4.834151
## PECUL23A000011 10.025821 10.012554 10.065096 9.914462 10.126599 10.018931
## PECUL23A000017  7.592364  7.807522  9.382773 8.858247  8.286428  8.111973
## PECUL23A000021  8.876448  8.818017  8.692063 8.664252  8.707617  8.606508
## PECUL23A000024  5.917729  6.440581  7.098071 5.770632  6.690472  6.443489

We can now pass this transformed counts data to a canned PCA function. We are going to use our Tidyverse knowledge however, to make the plot a bit more informative.

# standard DESeq2 PCA output top 500 most variable genes
plotPCA(dds_vst, intgroup=c("position", "background"), ntop=500)

# run PCA, but output the data, not the plot
pcaData<-plotPCA(dds_vst, intgroup=c("position", "background"), ntop=500, returnData=TRUE)

# extract percentage variance per PC axis
percentVar <- round(100 * attr(pcaData, "percentVar"))

# make a nice pca plot
ggplot(pcaData, aes(PC1, PC2, color=background, shape=position)) +
  geom_point(size=4) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  coord_fixed() +
  scale_color_manual(values=c("black","white"))+
  theme_dark() +
  ggtitle("PCA on 500 most variable genes")

Question: Can we see any patterns in the PCA biplot?

  • The first axis separates black from white background treatments
  • The second axis separates dorsal from ventral
  • The two axes explain almost the same amount of variance
  • There seems to be one dorsal sample that might be a potential outlier

Differential Expression Analysis

The PCA can be useful to detect any outliers or spot potential problems with our data, but it is not an official statistical test. Our ultimate goal is to see if a given gene is expressed to a different degree in one condition vs. another condition. Ideally we want to do that for all genes at once, and this is the aim of a differential gene expression analysis. At it's core this sounds almost like a T-test kind of problem, but things are not that simple. In this exercise, I don't want to focus too much on the theory, but it is worthwhile to briefly discuss some of the "problematic" features of our data, that would make fitting a simple statistical model difficult. Some of this we have already seen.

The differential expression analysis in DESeq2 takes several steps to try to deal with these problems. At its core it fits generalized linear models with a negative binomial distribution to each gene. It takes into account the estimated counts per gene, normalized by library size, but also takes into account the size of the target genes/transcripts and a dispersion parameter. This dispersion parameter is calculated using all genes, even though the differential expression test is later fitted to only a single gene at a time.

It then performs a pairwise comparison of the level of expression per gene in condition A vs. condition B. It uses the models per gene to estimate coefficients and standard error for each sample group (i.e. for condition A and condition B). These coefficients are the estimates of log2 fold change.

To test whether this log2 fold change for a given gene is significant, it then applies a statistical test (Wald test), to test the hypothesis that the parameter (log 2 fold change) is significantly different from 0. A multiple test correction is then applied, by default a version of the Benjamini-Hochberg False Discovery Rate (FDR). This is done by ranking the genes by p-value, then:

\(adjusted\ p\ value = p\ value * \frac{total\ number\ of\ tests}{rank}\)

In short, DESeq2 does the following:

  1. Estimation of size factors (control for differences in the library size of the sequencing experiments)
  2. Estimation of dispersion (e.g. for problem of large within-group variability and low sample sizes)
  3. Negative binomial GLM fitting (good for modeling discrete counts with over-dispersion)
  4. Wald statistics calculation and multiple testing correction (statistical significance testing)

These are all done with a single convenient function:

dds <- DESeq(dds)
dds
## class: DESeqDataSet 
## dim: 21412 12 
## metadata(1): version
## assays(6): counts avgTxLength ... H cooks
## rownames(21412): PECUL23A000004 PECUL23A000008 ... PECUL23A062646
##   PECUL23A062647
## rowData names(26): baseMean baseVar ... deviance maxCooks
## colnames(12): 7D 7V ... 22D 22V
## colData names(4): sample_id specimen_id position background

Note: it looks like we are "overwriting" the dds object, but really, we are just adding more inforamtion to it

This incredibly convenient function has done all of the heavy lifting for us, and it has even run some pair-wise tests already, based on the ~position + background variables we set.

We can access the names of the pairwise tests like so:

resultsNames(dds)
## [1] "Intercept"                 "position_V_vs_D"          
## [3] "background_white_vs_black"

Examining differential expression analysis results

We can extract the results for the pair-wise tests from the dds object like so:

# dorsal versus ventral
res_pos<-results(dds, name = "position_V_vs_D")
res_pos
## log2 fold change (MLE): position V vs D 
## Wald test p-value: position V vs D 
## DataFrame with 21412 rows and 6 columns
##                   baseMean log2FoldChange     lfcSE       stat    pvalue
##                  <numeric>      <numeric> <numeric>  <numeric> <numeric>
## PECUL23A000004 522.8802192     0.08596494  0.442275  0.1943698  0.845886
## PECUL23A000008   0.0884324     0.52069116  3.116540  0.1670735  0.867312
## PECUL23A000011 950.0207852    -0.05010279  0.127773 -0.3921240  0.694967
## PECUL23A000017 261.7948765    -0.17473465  0.289523 -0.6035266  0.546158
## PECUL23A000021 384.9774416     0.00797549  0.119486  0.0667484  0.946782
## ...                    ...            ...       ...        ...       ...
## PECUL23A062639   12.689338     -0.2622702  0.487949  -0.537495  0.590925
## PECUL23A062642    0.320325     -1.4628909  3.092154  -0.473098  0.636143
## PECUL23A062644    1.174657     -0.8341345  1.190627  -0.700584  0.483563
## PECUL23A062646    7.013169     -0.3679098  0.871624  -0.422097  0.672954
## PECUL23A062647  108.200574     -0.0261807  0.248048  -0.105547  0.915942
##                     padj
##                <numeric>
## PECUL23A000004   0.99974
## PECUL23A000008        NA
## PECUL23A000011   0.99974
## PECUL23A000017   0.99974
## PECUL23A000021   0.99974
## ...                  ...
## PECUL23A062639   0.99974
## PECUL23A062642        NA
## PECUL23A062644        NA
## PECUL23A062646   0.99974
## PECUL23A062647   0.99974
# black vs. white
res_bg<-results(dds, name = "background_white_vs_black")
res_bg
## log2 fold change (MLE): background white vs black 
## Wald test p-value: background white vs black 
## DataFrame with 21412 rows and 6 columns
##                   baseMean log2FoldChange     lfcSE      stat     pvalue
##                  <numeric>      <numeric> <numeric> <numeric>  <numeric>
## PECUL23A000004 522.8802192      -0.685910  0.442276 -1.550866  0.1209338
## PECUL23A000008   0.0884324       0.487485  3.116540  0.156419  0.8757031
## PECUL23A000011 950.0207852      -0.396665  0.127775 -3.104393  0.0019067
## PECUL23A000017 261.7948765       0.726795  0.289528  2.510272  0.0120638
## PECUL23A000021 384.9774416       0.122220  0.119487  1.022874  0.3063675
## ...                    ...            ...       ...       ...        ...
## PECUL23A062639   12.689338      -0.196683  0.487876 -0.403141 0.68684432
## PECUL23A062642    0.320325       1.510412  3.092154  0.488466 0.62521986
## PECUL23A062644    1.174657      -0.820934  1.189680 -0.690046 0.49016535
## PECUL23A062646    7.013169      -0.475565  0.871688 -0.545568 0.58536277
## PECUL23A062647  108.200574       0.651857  0.248119  2.627195 0.00860921
##                     padj
##                <numeric>
## PECUL23A000004 0.3095127
## PECUL23A000008        NA
## PECUL23A000011 0.0184354
## PECUL23A000017 0.0687015
## PECUL23A000021 0.5423588
## ...                  ...
## PECUL23A062639 0.8399464
## PECUL23A062642        NA
## PECUL23A062644        NA
## PECUL23A062646 0.7719239
## PECUL23A062647 0.0537317

These tables have the following information:

  1. Rows are genes.
  2. Per gene, there are 6 metrics:

Question: Is there anything unexpected about the adjusted p-values?

Some of the adjusted p-values are NA. This is what the DESeq2 handbook tells us:

"If a row is filtered by automatic independent filtering, for having a low mean normalized count, then only the adjusted p value will be set to NA. Description and customization of independent filtering is described below".

Let's check some of those counts with NA padj:

counts(dds)[rownames(res_pos)[is.na(res_pos$padj)],] %>% head(15)
##                7D 7V 8D 8V 10D 10V 11D 11V 21D 21V 22D 22V
## PECUL23A000008  0  0  0  0   0   0   0   0   0   1   0   0
## PECUL23A000029  0  0  1  0   0   0   0   0   0   0   0   0
## PECUL23A000056  0  0  0  0   0   0   0   0   2   0   2   2
## PECUL23A000063  0  0  0  0   0   0   0   0   0   0   1   1
## PECUL23A000099  1  0  0  1   0   0   0   3   1   0   1   2
## PECUL23A000126  1  0  0  0   0   0   2   0   1   1   1   0
## PECUL23A000130  1  0  0  0   0   0   0   0   0   0   0   0
## PECUL23A000195  0  0  0  0   0   0   2   0   0   0   0   1
## PECUL23A000197  1  2  0  0   1   2   0   2   6   2   2   0
## PECUL23A000266  0  2  3  0   1   0   3   0   0   0   0   0
## PECUL23A000277  0  0  0  0   1   0   0   0   0   0   0   0
## PECUL23A000289  7 11  6 14  12 268   3  12   8  10   5  14
## PECUL23A000302  0  0  0  0   0   1   0   0   1   1   0   1
## PECUL23A000303  1  2  0  0   0   0   0   0   0   0   0   0
## PECUL23A000313  0  0  0  1   0   0   0   0   0   0   0   0

Indeed! all have very low counts!! they are therefore unreliable, and should rightfully not be included in the results.

We can now summarize these results tables:

summary(res_pos, alpha=0.05)
## 
## out of 21363 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 324, 1.5%
## LFC < 0 (down)     : 342, 1.6%
## outliers [1]       : 160, 0.75%
## low counts [2]     : 4548, 21%
## (mean count < 2)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
summary(res_bg, alpha=0.05)
## 
## out of 21363 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 1537, 7.2%
## LFC < 0 (down)     : 936, 4.4%
## outliers [1]       : 160, 0.75%
## low counts [2]     : 5366, 25%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

This returns the number of up-regulated genes (LFC>0) and down-regulated genes (LFC<0) with respect to the control variable. Other information is also given, such as the number of outliers and low counts.

Question: What do you notice about these results?

  • There are more DEGs when comparing the dark versus light tadpoles than when comparing the dorsal versus ventral skin.

Visualizing results

Volcano plot

A plot resulting from a differential gene expression analysis that you may also have seen is a "Volcano" plot. These show the log fold change plotted against the adjusted p-values per gene. To make small p-values (padj) very large, the p-values are usually -log10() transformed. The most basic plot would look like this:

res_pos %>%
  as_tibble(rownames = "gene_id") %>%
  ggplot(aes(x=log2FoldChange, y=-log10(padj))) +
  geom_point(alpha=0.75, shape=16) +
  ggtitle("Dorsum vs. Ventrum")
## Warning: Removed 4757 rows containing missing values or values outside the scale range
## (`geom_point()`).

In such a plot, each dot represents a gene and the dots with positive fold changes (x-axis) are genes that are over expressed in the treatment versus the control and the genes with negative fold changes are those that are underexpressed. As in this case, we don't always have a clear control versus treatment set up, but DESeq2 will always take second contrast as the control. In our case this was the dorsal skin. This means that all genes on the left of 0 are over expressed in the dorsal skin and all genes to the right of 0 are over expressed in the ventral skin.

The degree of significance (adjusted p-value) is shown on the y-axis. This higher up genes are, the more significant their differential expression.

We can tweak this a little bit to highlight just those genes that are highly differentially expressed. Let's put a p-value threshold at <0.05 and a log fold change threshold at >2.

res_pos %>%
  as_tibble(rownames = "gene_id") %>%
  drop_na(padj) %>% # drop all genes with NAs
  filter(padj<0.5) %>% # reduce the number of points that need to be plotted
  mutate(significant= padj<0.05 & abs(log2FoldChange)>=2) %>% # make a variable to indicate if a gene is significant based on a specific thresholds
  ggplot(aes(x=log2FoldChange, y=-log10(padj), color=significant)) +
  geom_point(alpha=0.75, shape=16) +
  ggtitle("Dorsum vs. Ventrum") +
  theme_bw()

Of course, now we would really like to know what those genes are! So far, we have been working with gene IDs that refer to identified transcripts/genes in the Pelobates cultripes genome assembly, but they have little real meaning to us. How to obtain annotations is a topic in itself and ideally would come from functional studies. However, such studies tend to only be very extensive for model systems and so we often rely in homology. I.e. we will assume that gene sequences with similar amino acid configurations will have similar functions. We can therefore use sequence similarity tools, like BLAST to map genes and their annotations from model organisms (in our case Xenopus tropicalis) onto our genome. Lets import these annotations.

annotations<-read_csv("../data/annotations.csv")
annotations
## # A tibble: 24,392 × 4
##    gene_id        xen_pep_id                     xen_gene_symbol xen_description
##    <chr>          <chr>                          <chr>           <chr>          
##  1 PECUL23A000004 ENSXETP00000113033             kiaa1522        KIAA1522 [Sour…
##  2 PECUL23A000005 ENSXETP00000118270             <NA>            <NA>           
##  3 PECUL23A000006 ENSXETP00000117992             <NA>            <NA>           
##  4 PECUL23A000008 ENSXETP00000100780             pltp            phospholipid t…
##  5 PECUL23A000011 ENSXETP00000042154;ENSXETP000… fxr1            FMR1 autosomal…
##  6 PECUL23A000012 ENSXETP00000115735             clasrp          CLK4-associati…
##  7 PECUL23A000017 ENSXETP00000041999             lyg2            lysozyme G-lik…
##  8 PECUL23A000018 ENSXETP00000105337             <NA>            <NA>           
##  9 PECUL23A000019 ENSXETP00000079593             krt34           keratin 34 [So…
## 10 PECUL23A000021 ENSXETP00000033990;ENSXETP000… pdzd2;znf207    PDZ domain con…
## # ℹ 24,382 more rows

Question: Do all of our genes have homologous genes in X. tropicals? And are they all annotatated? What does this mean?

annotations %>%
  filter(is.na(xen_pep_id)) %>%
  nrow()
## [1] 251
annotations %>%
  filter(is.na(xen_gene_symbol)) %>%
  nrow()
## [1] 5623
  • More than 200 P.cultripes loci did not match any loci in X. tropicals. Miss-assembly? 200 million years of independent evolution?
  • More than 5500 loci have no annotations. This is what is called the dark proteome (porteins with no defined 3D structure). It is estimated that 40-50% of the eukaryote proteome is dark!

Lets add these annotations to our plot. We could do this by adding text label to only our significant genes

res_pos %>%
  as_tibble(rownames = "gene_id") %>%
  drop_na(padj) %>% # drop all genes with NAs
  filter(padj<0.5) %>% # reduce the number of points that need to be plotted
  mutate(significant= padj<0.05 & abs(log2FoldChange)>=2) %>% # make a variable to indicate if a gene is significant based on a specific thresholds
  left_join(annotations) %>% # pull across the extra columns, based on a "gene_id" match
  ggplot(aes(x=log2FoldChange, y=-log10(padj), color=significant, label=xen_gene_symbol)) +# add label text
  geom_point(alpha=0.75, shape=16) +
  geom_text(data= . %>% filter(significant)) + # add text labels for only the significant genes
  ggtitle("Dorsum vs. Ventrum") +
  theme_bw() +
  theme(legend.position="none")
## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_text()`).

This is a little messy. We could try and clean it up by having the text labels repel each other:

res_pos %>%
  as_tibble(rownames = "gene_id") %>%
  drop_na(padj) %>% # drop all genes with NAs
  filter(padj<0.5) %>% # reduce the number of points that need to be plotted
  mutate(significant= padj<0.05 & abs(log2FoldChange)>=2) %>% # make a variable to indicate if a gene is significant based on a specific thresholds
  left_join(annotations) %>% # pull across the extra columns, based on a "gene_id" match
  ggplot(aes(x=log2FoldChange, y=-log10(padj), color=significant, label=xen_gene_symbol)) + 
  geom_point(alpha=0.75, shape=16) +
  geom_text_repel(data= . %>% filter(significant)) + 
  ggtitle("Dorsum vs. Ventrum") +
  theme_bw() +
  theme(legend.position="none")
## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_text_repel()`).
## Warning: ggrepel: 73 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

This is already much better, but geom_text_repel() has decided to drop labels for points that are too close to each other. An interesting alternative is therefore to make this plot interactive, so that labels are only shown when we hover over points. This is easier to do than might think with the javascript package plotly.

# save the plot as a ggplot object
p1<-res_pos %>%
  as_tibble(rownames = "gene_id") %>%
  drop_na(padj) %>% # drop all genes with NAs
  filter(padj<0.5) %>% # reduce the number of points that need to be plotted
  mutate(significant= padj<0.05 & abs(log2FoldChange)>=2) %>% # make a variable to indicate if a gene is significant based on a specific thresholds
  left_join(annotations) %>% # pull across the extra columns, based on a "gene_id" match
  ggplot(aes(x=log2FoldChange, y=-log10(padj), color=significant, label=xen_gene_symbol)) + 
  geom_point(alpha=0.75, shape=16) +
  ggtitle("Dorsum vs. Ventrum") +
  theme_bw() +
  theme(legend.position="none")

# convert the ggplot to a plotly object
ggplotly(p1, tooltip = "label") # define the "label" aesthetic as the hover text

To get back to our story at hand though, lets make one final plot, where we show volcanoes for both comparisons (dorsal versus ventral and black versus white), and lets add only those annotations that are related to the photosensitive melanin synthesis pathways.

# load a reduced annotation file 
melanin_genes<-read_csv("../data/melanin_genes.csv")
melanin_genes
## # A tibble: 14 × 3
##    gene_id        gene_symbol gene_name                               
##    <chr>          <chr>       <chr>                                   
##  1 PECUL23A055459 mc1r        melanocortin-1 receptor                 
##  2 PECUL23A019284 asip        agouti-signaling                        
##  3 PECUL23A040111 mitf        Melanocyte Inducing Transcription Factor
##  4 PECUL23A015118 sox10       SRY-box 10                              
##  5 PECUL23A023288 pax3        paired box 3                            
##  6 PECUL23A000301 tyr         tyrosinase                              
##  7 PECUL23A011713 dct         dopachrome tautomerase                  
##  8 PECUL23A038095 tyrp1       tyrosinase-related protein 1            
##  9 PECUL23A009622 pmel        premelanosome protein                   
## 10 PECUL23A034305 oca2        oculocutaneous albinism II              
## 11 PECUL23A039712 slc24a5     solute carrier family 24 member 5       
## 12 PECUL23A053655 slc45a2     solute carrier family 45 member 2       
## 13 PECUL23A047519 mlph        melanophilin                            
## 14 PECUL23A043421 mlana       melan-A
# combine the two DESeq results tables
res<-bind_rows(res_pos %>%
                 as_tibble(rownames="gene_id") %>%
                 mutate(contrast="Dorsal vs. Ventral"),
               res_bg %>%
                 as_tibble(rownames="gene_id") %>%
                 mutate(contrast="Black vs. White"))
# plot
res %>%
  drop_na(padj) %>% 
#  filter(padj<0.5) %>% 
  mutate(sig= padj<0.05) %>% # lets be a bit more relaxed and not put a threshold on the fold change 
  left_join(melanin_genes) %>%
  ggplot(aes(x=log2FoldChange, y=-log10(padj), color=sig, label=gene_symbol)) +
  geom_point(alpha=0.75, shape=16) +
  geom_text_repel(max.overlaps = Inf) + # set the maximum overlap to infinity, to show all labels.
  xlim(-15,15) + # fix axis width
  facet_wrap(~contrast) + # facet based on contrast
  theme_minimal() +
  scale_color_manual(values=c("grey","black")) +
  theme(legend.position = "none")
## Warning: Removed 32464 rows containing missing values or values outside the scale range
## (`geom_text_repel()`).

Lets look at our gene pathway schematics again and discuss the results.

Question: What genes from these pathways did we recover and are they expressed in a way that we expect?

  • For both contrasts, the darker phenotype shows overexpression of genes related to tyrosine metabolism
  • asip expressed in ventral skin seems to shut down the whole pathway
  • mc1r and the master regulator genes are expressed in the light-exposed contrast, not necessarily the melanized contrast. There seems to be a decoupling of this receptor and downstream regulators to allow for effective crypsis via background matching

Reaction Norms

As a last bonus, to match our phenotype plots, we can also show our expression data as reaction norms. For this, we can use the normalized expression values from the VST matrix.

assay(dds_vst) %>%
  as_tibble(rownames="gene_id") %>%
  left_join(melanin_genes) %>% # add annotations
  drop_na(gene_symbol) %>% # keep only genes with annotations
  select(gene_id, gene_symbol, where(is.numeric)) %>%
  pivot_longer(-c(gene_id, gene_symbol), names_to="sample_id", values_to = "expression") %>%
  left_join(rna_samples) %>% # pull in experimental info about samples
  # instead of calculating means and standard deviations, we will use the stat_summary() layer from ggplot to calculate these on the fly.
  ggplot(aes(x=position, y=expression, color=background)) +
  stat_summary(aes(group = background), fun = mean, geom = "path") +
  stat_summary(aes(color = background), fun.data = mean_cl_boot, geom = "errorbar", width = 0.1) +
  stat_summary(aes(color = background), fun= mean, geom = "point", size = 2) +
  scale_color_manual(values = c("black"="black","white"="grey80")) +
  facet_wrap(~gene_symbol) +
  theme_bw() +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) # remove grid lines